UMBC High Performance Computing Facility : HPC Frequently Asked Questions
This page last changed on May 08, 2009 by straha1.
Questions about Getting an AccountHow do I get an account on HPC?Fill out our account request form. All new users must fill out that page. How do I get an account on another one of UMBC's clusters?See the respective websites for the other clusters. How do I join the financing of HPC?Contact our HPC Point of Contact, Dr. Matthias K. Gobbert. Your cluster cannot fulfill my needs. What other options do I have?If your concerns are about lack of software, contact our HPCF Point of Contact, Dr. Matthias K. Gobbert about adding new software – that is usually not a problem. If our cluster's hardware limitations are the concern, one of our other clusters may suit your needs. Also, we are planning on expanding HPC's computing and file storage capabilities. If you wish to inquire about or financially contribute to that effort, contact our HPCF Point of Contact. If you need significantly more resources than our cluster can provide, you should consider applying for an account on Teragrid – the NSF terascale computing infrastructure. It consists of over a dozen clusters all over the US, numerous web resources and consulting services. It is described by the Teragrid about page as, "an open scientific discovery infrastructure combining leadership class resources at eleven partner sites to create an integrated, persistent computational resource." See our page about getting access to Teragrid for useful Teragrid-related links, an explanation of the process of obtaining a Teragrid account, and tips about how to write your Teragrid proposal. Questions Relating to JobsWhy do my jobs always randomly quit after four hours?By default, the maximum run length of jobs is four hours. This is to prevent new users from accidentally running jobs that take up the whole cluster for an entire day. Your jobs can have a longer maximum run time (called the wall clock limit) if you set it using the -l walltime= option as described on this page: How do I run interactive jobs on the cluster nodes?You can't. Interactive jobs violate our usage policies. You should run any interactive programs on the head node. If you need to run them on the cluster nodes, contact our HPCF Point of Contact. How do I submit jobs?That depends on the type of job. See our HPC Compilation and Job Submission Tutorial for details on submitting many common types of jobs. If you cannot figure out how to do what you are trying to do, contact us. How do I cancel a job?Use the qdel program described here. I submitted a job but I just realized I made a mistake that makes that job useless to me. What should I do?Please cancel your job. This is important because it will increase the number of processors available to other users and yourself. To cancel your job, use the qdel program described here. My job has been in the queue forever in the "Q" state. Why does it never run?That is probably because either there are not enough nodes available to run your job or the cluster's scheduler has decided to run other jobs before yours. You can reduce the chances of that by using methods described here. However, it is possible that someone else's job has gotten stuck, or that there is another problem on the cluster. If you suspect that may be the case, run qstat: username@hpc:~> qstat Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 3101.hpc PossiblyHungJob errorman 00:00:31 E low_priority 3140.hpc MyHomework johndoe 00:00:00 R low_priority 3141.hpc Cell2304A janedoe 00:00:00 R low_priority 3142.hpc ComputePi pieeater 00:00:00 Q high_priority If qstat lists many jobs whose state ("S" column) is "R" or "Q" then there are probably no problems on the cluster – there are just a lot of jobs taking up nodes. If a job has been in the "R" state for most of the day or if you see jobs that are in states other than "Q" or "R" for more than a few seconds, then something is wrong. If so, contact us. My job should have finished, but it is stuck in the "R" state.If you are sure your job has finished, but you find it in the "R" state, it may be due to an error in your job or a an error in HPC's job execution software. If you are sure your job has finished, first try to use qdel to kill the job. Instructions are here. Your job should then exit and be removed from the queue within a few minutes. You can monitor the queue to check and see if your job is in the queue by using the qstat command. After a few minutes, if your job is still stuck in the "R" state or another state then there s probably a system error, and perhaps even a hardware failure. The nodes your job was using will most likely remain unavailable for a while and so this is an urgent problem. Contact our support personnel right away and we will contact the right system administrators for you. My job is stuck in a state other than "R" or "Q".If your job is in a weird state for more than a minute or two, then that is probably due to a system error, perhaps even a hardware failure. The nodes your job was using will most likely remain unavailable for a while and so this is an urgent problem. Contact our support personnel right away and we will contact the right system administrators for you. How do I determine what jobs are running on the cluster?Use the qstat command described here. How do I get detailed information about my jobs?Use the qstat command described here. Questions Relating to Software on HPCI need software that is not on HPC. What do I do?Send an email to our support personnel. What compilers and MPI implementations are available on HPC?Your options and advice on choosing between them is on this page: How do I use COMSOL on HPC?COMSOL is installed on HPC, but we have not come up with the best way to execute it yet. The complication is that the way in which COMSOL is run has just changed due to a lawsuit between MathWorks (the makers of Matlab) and the COMSOL Group. Contact our support personell if you want more information. How do I use IDL, Matlab, R or SAS on HPC?Instructions are on these pages: Questions Relating to File StorageWhere can I store files on HPC?See these pages for details on this issue: Disk Space and File Quotas and Initial Setup of Your Account. I am unable to create new files in my home directory. What is going on?Your home directory has a quota – limits on the total storage space you can use and the number of files you can create. If you go past these limits (over your quota) then you can no longer create files or make existing files larger. The quotas are very low since the medium that stores home directories is small. You have other storage options available. To learn about home directory quotas and your other storage options, see this page: Why am I am unable to create or delete files in my scratch or common partition?You probably ran out of space. Your common and scratch partitions do not have simple quotas like your home directory does. The reason for this is the method that is used to store archives of your common and scratch data. You have two separate space limitations. One is the total amount of storage space left in what has been officially allocated to you, and one is a bit more complicated than that. All data in common and scratch areas are stored on machines usually called "thumpers." These thumpers use the XFS filesystem which stores not only your files, but also previous files that were deleted. When you delete files, XFS will archive them for up to ten days. Your research group has some amount of storage space officially allocated to it (usually 100 GB). However, in addition to that, you also have a limit to the amount of storage space that can be used for archived files. If you go over your official space limitation, you should be unable to create files. If you go over your limit for archived files you will be unable to delete files. The reason for that is that the thumpers will try to store the file that you are trying to delete. If the file is too large, the deletion will fail. Usually, you will need to contact a system administrator to fix that problem. (See our Contact Information page.) However, sometimes you can fix the problem on your own. Sometimes you can fix the problem yourself by shrinking large files down to almost zero size and then deleting them. For example, suppose you have a 100 GB file called too-big.file that you created by accident. You are unable to delete it since, with all the other files you have created and deleted, you have used up too much space and the thumpers cannot store an archived copy of the too-big.file file. Instead of merely deleting it, do this: echo > too-big.file rm too-big.file The first command replaces the contents of too-big.file with a single end-of-line character. Thus the size of the file shrinks from 100 GB to one byte. The second command removes that file. The thumpers only have to store a copy of the one byte version of the file. Questions Relating to Connecting to HPCI tried to send a message to an HPC mailing list (hpc-support or hpc-users) but I was unable to. What is going on?You are probably getting a message like this: Date: Mon, 12 Jan 2009 23:43:13 -0500 (EST) From: UMBC Mailing List Manager <sympa@lists.umbc.edu> To: (your email address here) Subject: Message diffusion: Authorization reject Your message addressed to [Old contact info removed] was rejected. You are not allowed to send this message for the following reason: Message diffusion in the list is restricted to local domain users. Any messages sent to hpc-support or hpc-users must be sent from an address within UMBC. If you are sending the email from a UMBC email address and still cannot send, try to contact the UMBC helpdesk. The page that link sends you to includes phone numbers, an email address and a physical location. How do I transfer files to and from HPC?See this page: How to Copy Files to and from HPC. |
Document generated by Confluence on Mar 31, 2011 15:37 |